-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Linear Interpolation of nan
s via cupy
#8767
Linear Interpolation of nan
s via cupy
#8767
Conversation
python/cudf/cudf/core/dataframe.py
Outdated
method in {"index", "values"} | ||
and not self.index.is_monotonic_increasing | ||
): | ||
warnings.warn("Unsorted Index...") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we do this? What should we put here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As usual, pandas
seems OK with some pretty nonsensical cases:
In [83]: pd.Series([2, None, 4, None, 2], index=[1, 2, 3, 2, 1]).interpolate('values')
Out[83]:
1 2.0
2 3.0
3 4.0
2 3.0
1 2.0
dtype: float64
Codecov Report
@@ Coverage Diff @@
## branch-21.10 #8767 +/- ##
===============================================
Coverage ? 10.57%
===============================================
Files ? 116
Lines ? 19050
Branches ? 0
===============================================
Hits ? 2015
Misses ? 17035
Partials ? 0 Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't 100% sure whether you were looking for a review yet or not since this is marked WIP but you posted questions, so I just went through and left some (hopefully helpful) comments.
nan
s via cupy
nan
s via cupy
Co-authored-by: Vyas Ramasubramani <[email protected]>
…into fea-linear-interp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies, I started this review this morning then forgot to submit. A few more changes to be made, but it's nearly there.
python/cudf/cudf/core/frame.py
Outdated
if not isinstance(data._index, cudf.RangeIndex): | ||
# that which was once sorted, now is not | ||
result = result._gather(perm_sort.argsort()) | ||
|
||
return result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if not isinstance(data._index, cudf.RangeIndex): | |
# that which was once sorted, now is not | |
result = result._gather(perm_sort.argsort()) | |
return result | |
return result if isinstance(data._index, cudf.RangeIndex) else result._gather(perm_sort.argsort()) |
I assume this will need a line break somewhere to make black
happy.
python/cudf/cudf/core/frame.py
Outdated
result = interpolator(col, index=data._index) | ||
columns[colname] = result | ||
|
||
result = self.__class__(ColumnAccessor(columns), index=data._index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use _from_data
instead.
python/cudf/cudf/core/frame.py
Outdated
col = col.astype("float64").fillna(np.nan) | ||
|
||
# Interpolation methods may or may not need the index | ||
result = interpolator(col, index=data._index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
result = interpolator(col, index=data._index) | |
columns[colname] = interpolator(col, index=data._index) |
Github won't let me highlight two lines here because there's a previous comment I guess, but you'll also need to delete the subsequent line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pattern is left over from a lot of pdb
-ing around :(
python/cudf/cudf/core/frame.py
Outdated
) | ||
|
||
data = self | ||
columns = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd move this to just below the interpolator =...
line so it's closer to where it's used.
@pytest.mark.parametrize("method", ["linear"]) | ||
@pytest.mark.parametrize("axis", [0]) | ||
def test_interpolate_dataframe(data, method, axis): | ||
# doesn't seem to work with NAs just yet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this an issue still?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated this with a more descriptive comment, nullable dtypes don't interpolate in pandas yet as there are some bugs it seems, our impl treats nulls and nans the same.
Co-authored-by: Vyas Ramasubramani <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good on my end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really good and neat! Great work, @brandon-b-miller!
rerun tests |
@gpucibot merge |
Adds Series and DataFrame level functions for linear interpolation of missing values, built around CuPy's
interp
method.Pandas
interpolate
API supports somewhat varied functionality for fillingNaN
s. It currently does not work for actual<NA>
values - pandas issue here.. That said one might expect both kinds of missing data to be treated equally for the purposes of interpolation, and this PR does that.While
cp.interp
is great for getting us off the ground, but only supports linear interpolation and its results aren't exactly what pandas produces. In particular pandas will not fillNaN
s at the start of the series, because the default value oflimit_direction
isforward
and the defaultlimit
isNone
which from my experimentation means 'unlimited'. This means that that despite this, theNaN
s at the end WILL get filled. This means we need to actually figure out where the first NaN is and mask out that part of the series withNaN
s.Closes #8685.